AITopics | image feature

Collaborating Authors

image feature

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DON'TNEEDRETRAINING: AMixture of DETR and Vision Foundation Models for Cross-Domain Few-Shot Object Detection

Neural Information Processing SystemsJun-22-2026, 22:11:22 GMT

Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to generalize to unseen domains by leveraging a few annotated samples of the target domain, requiring models to exhibit both strong generalization and localization capabilities. However, existing well-trained detectors typically have strong localization capabilities but suffer from limited generalization, whereas vision foundation models (VFMs) generally exhibit better generalization but lack accurate localization capabilities. In this paper, we propose a novel Mixture-of-Experts (MoE) structure that integrates the detector's localization capability and the VFM's generalization by using VFM features to improve detector features. Specifically, we propose Expert-wise Router (ER) that dynamically selects the most relevant VFM experts for each backbone layer, and Region-wise Router (RR) that emphasizes foreground and suppress background. To bridge representation gaps, we further propose Shared Expert Projection (SEP) module and Private Expert Projection (PEP) module, which align VFM features to the detector feature space while decoupling shared image feature from private image feature in the VFM feature map. Finally, we construct MoE module to transfer the VFM's generalization to the detector without modifying the original detector architecture. Furthermore, our method extend well-trained detectors for detecting novel classes in unseen domains without re-training on the base classes.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Education (0.54)
Health & Medicine (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

Neural Information Processing SystemsJun-22-2026, 12:07:08 GMT

Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Transportation > Passenger (0.67)
Leisure & Entertainment (0.67)
Transportation > Ground > Road (0.46)
Transportation > Ground > Rail (0.45)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

9796170d31d42b943534df40bdee68d3-Paper-Conference.pdf

Neural Information Processing SystemsJun-20-2026, 04:37:55 GMT

Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Overview (0.67)

Industry:

Information Technology (0.93)
Transportation > Ground (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)

Add feedback

UFO: AUnified Approach to Fine-grained Visual Perception via Open-ended Language Interface

Neural Information Processing SystemsJun-18-2026, 17:40:46 GMT

Generalist models have achieved remarkable success in both language and visionlanguage tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific designs and architectures that can complicate the modeling process. To address this challenge, we present UFO, a framework that Unifies Fine-grained visual perception tasks through an Open-ended language interface.

large language model, machine learning, segmentation, (20 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

Energy: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OODGeneralization

Neural Information Processing SystemsJun-18-2026, 11:52:47 GMT

Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs' generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities--specifically by directly reducing the maximum cosine similarity to a low value--we introduce a novel OOD score, named Energy.

detection, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre:

Research Report > Experimental Study (1.00)
Overview (0.65)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

Neural Information Processing SystemsJun-17-2026, 16:09:54 GMT

Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations, where they generate descriptions of nonexistent objects, actions, or concepts, posing significant risks in safety-critical applications. Existing hallucination mitigation methods typically follow one of two paradigms: generation adjustment, which modifies decoding behavior to align text with visual inputs, and post-hoc verification, where external models assess and correct outputs. While effective, generation adjustment methods often rely on heuristics and lack correction mechanisms, while post-hoc verification is complicated, typically requiring multiple models and tending to reject outputs rather than refine them. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification. By leveraging a new hallucination-verification dataset containing over 1.3M semi-synthetic samples, along with a novel inference-time retrospective resampling technique, our approach enables VLMs to both detect hallucinations during generation and dynamically revise those hallucinations. Our evaluations show that REVERSE achieves state-of-the-art hallucination reduction, outperforming the best existing methods by up to 12% on CHAIR-MSCOCO and 34% on HaloQuest.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Asia (0.67)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.68)

Industry:

Information Technology (1.00)
Government > Regional Government > North America Government > United States Government (0.67)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Enhancing via Cross Modality Alignment

Neural Information Processing SystemsJun-15-2026, 06:41:15 GMT

Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization, they often overlook the gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose CrOss-modaLity Alignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)

Add feedback

XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiTModulation

Neural Information Processing SystemsJun-14-2026, 14:37:52 GMT

Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs).

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > Promising Solution (0.67)

Industry: Information Technology (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

DON'T NEED RETRAINING: A Mixture of DETR and Vision Foundation Models for Cross-Domain Few-Shot Object Detection

Neural Information Processing SystemsJun-14-2026, 05:27:32 GMT

Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to generalize to unseen domains by leveraging a few annotated samples of the target domain, requiring models to exhibit both strong generalization and localization capabilities. However, existing well-trained detectors typically have strong localization capabilities but lack generalization, whereas vision foundation models (VFMs) generally exhibit better generalization but lack accurate localization capabilities. In this paper, we propose a novel Mixture-of-Experts (MoE) structure that integrates the detector's localization capability and the VFM's generalization by using VFM features to improve detector features. Specifically, we propose Expert-wise Router (ER) that selects the most relevant VFM experts for each backbone layer, and Region-wise Router (RR) that emphasizes foreground and suppress background. To bridge representation gaps, we further propose Shared Expert Projection (SEP) module and Private Expert Projection (PEP) module, which align VFM features to the detector feature space while decoupling shared image feature from private image feature in the VFM feature map. Finally, we propose MoE module to transfer the VFM's generalization to the detector without altering the detector original architecture. Furthermore, our method extend well-trained detectors for detecting novel classes in unseen domains without re-training on the base classes.

artificial intelligence, localization capability, machine learning, (11 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback